Should genes with missing data be excluded from phylogenetic analyses?

نویسندگان

  • Wei Jiang
  • Si-Yun Chen
  • Hong Wang
  • De-Zhu Li
  • John J Wiens
چکیده

Phylogeneticists often design their studies to maximize the number of genes included but minimize the overall amount of missing data. However, few studies have addressed the costs and benefits of adding characters with missing data, especially for likelihood analyses of multiple loci. In this paper, we address this topic using two empirical data sets (in yeast and plants) with well-resolved phylogenies. We introduce varying amounts of missing data into varying numbers of genes and test whether the benefits of excluding genes with missing data outweigh the costs of excluding the non-missing data that are associated with them. We also test if there is a proportion of missing data in the incomplete genes at which they cease to be beneficial or harmful, and whether missing data consistently bias branch length estimates. Our results indicate that adding incomplete genes generally increases the accuracy of phylogenetic analyses relative to excluding them, especially when there is a high proportion of incomplete genes in the overall dataset (and thus few complete genes). Detailed analyses suggest that adding incomplete genes is especially helpful for resolving poorly supported nodes. Given that we find that excluding genes with missing data often decreases accuracy relative to including these genes (and that decreases are generally of greater magnitude than increases), there is little basis for assuming that excluding these genes is necessarily the safer or more conservative approach. We also find no evidence that missing data consistently bias branch length estimates.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Running Head: PHYLOGENOMIC SAMPLING STRATEGIES How Should Genes and Taxa be Sampled for Phylogenomic Analyses with Missing Data? An Empirical Study in Iguanian Lizards

–Targeted sequence capture is becoming a widespread tool for generating large phylogenomic datasets to address difficult phylogenetic problems. However, this methodology often generates datasets in which increasing the number of taxa and loci increases amounts of missing data. Thus, a fundamental (but still unresolved) question is whether sampling should be designed to maximize sampling of taxa...

متن کامل

A Case Study for Effects of Operational Taxonomic Units from Intracellular Endoparasites and Ciliates on the Eukaryotic Phylogeny: Phylogenetic Position of the Haptophyta in Analyses of Multiple Slowly Evolving Genes

Recent multigene phylogenetic analyses have contributed much to our understanding of eukaryotic phylogeny. However, the phylogenetic positions of various lineages within the eukaryotes have remained unresolved or in conflict between different phylogenetic studies. These phylogenetic ambiguities might have resulted from mixtures or integration from various factors including limited taxon samplin...

متن کامل

Phylogenetics of advanced snakes (Caenophidia) based on four mitochondrial genes.

Phylogenetic relationships among advanced snakes (Acrochordus + Colubroidea = Caenophidia) and the position of the genus Acrochordus relative to colubroid taxa are contentious. These concerns were investigated by phylogenetic analysis of fragments from four mitochondrial genes representing 62 caenophidian genera and 5 noncaenophidian taxa. Four methods of phylogeny reconstruction were applied: ...

متن کامل

Missing data and the design of phylogenetic analyses

Concerns about the deleterious effects of missing data may often determine which characters and taxa are included in phylogenetic analyses. For example, researchers may exclude taxa lacking data for some genes or exclude a gene lacking data in some taxa. Yet, there may be very little evidence to support these decisions. In this paper, I review the effects of missing data on phylogenetic analyse...

متن کامل

Do missing data influence the accuracy of divergence-time estimation with BEAST?

Time-calibrated phylogenies have become essential to evolutionary biology. A recurrent and unresolved question for dating analyses is whether genes with missing data cells should be included or excluded. This issue is particularly unclear for the most widely used dating method, the uncorrelated lognormal approach implemented in BEAST. Here, we test the robustness of this method to missing data....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Molecular phylogenetics and evolution

دوره 80  شماره 

صفحات  -

تاریخ انتشار 2014